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Abstract- Data science is an interdisciplinary 
field that extracts knowledge and insights from 
many structural and unstructured data, using 
scientific methods, data mining techniques, 
machine-learning algorithms, and big data. The 
healthcare industry generates large datasets of 
useful information on patient demography, 
treatment plans, results of medical examinations, 
insurance, etc. The data collected from the 
Internet of Things (loT) devices attract the 
attention of data scientists. Data science 
provides aid to process, manage, analyze and 
assimilate the large quantities of fragmented, 
structured and unstructured data created by 
healthcare systems. This data requires effective 
management and analysis to acquire factual 
results. The process of data cleansing, data 
mining, data preparation, and data analysis used 
in healthcare applications is reviewed and 
discussed in the article. The article provides an 
insight into the status and prospects of big data 
analytics in healthcare, highlights the 
advantages, describes the frameworks and 
techniques used, briefs about the challenges 
faced currently and discusses viable solutions. 
Data science and big data analytics can provide 
practical insights and aid in the decision-making 
of strategic decisions concerning the health 
system. It helps build a comprehensive view of 
patients, consumers, and clinicians. Data-driven 
decision-making opens up new possibilities to 
boost healthcare quality. 


Keywords: Big data, Data analytics, Data mining 
Healthcare, Healthcare informatics. 


l. INTRODUCTION 
The evolution in the digital era has led to 
the confluence of healthcare and technology 
resulting in the emergence of newer data- 
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related applications [1]. Due to the voluminous 
amounts of clinical data generated from the health 
care sector like the Electronic Health Records (EHR) 
of patients, prescriptions, clinical reports, information 
about the purchase of medicines, medical insurance- 
related data, investigations, and laboratory reports, 
there lies an immense opportunity to analyze and 
study these using recent technologies [2]. The huge 
volume of data can be pooled together and analyzed 
effectively using machine-learning algorithms. 
Analyzing the details and understanding the patterns 
in the data can help in better decision-making 
resulting in a better quality of patient care. It can aid 
to understand the trends to improvise the outcome of 
medical care, life expectancy, early detection, and 
identification of disease at an initial stage and 
required treatment at an affordable cost [3]. Health 
Information Exchange (HIE) can be implemented 
which will help in extracting clinical information across 
various distinct repositories and merge it into a single 
person’s health record allowing all care providers to 
access it securely. Hence, the organizations 
associated with healthcare must attempt to procure 
all the available tools and infrastructure to make use 
of the big data, which can augment the revenue and 
profts and can establish better healthcare networks, 
and stand apart to reap significant benefits [4, 5]. 
Data mining techniques can create a shift from 
conventional medical databases to a knowledge-rich, 
evidence-based healthcare environment in the 
coming decade. 

Big data and its utility in healthcare and medical 
sciences have become more critical with the dawn of 
the social media era (platforms such as Facebook 
and Twitter) and smartphone apps that can monitor 
personal health parameters using sensors and 
analyzers [6, 7]. The role of data mining is to 
improvise the stored user information to provide 
superior treatment and care. This review article 
provides an insight into the advantages and 
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methodologies of big data usage in health care 
systems. It highlights the voluminous data 
generated in these systems, their qualities, 
possible security related problems, data 
handling, and how this analytics support gaining 
significant insight into these data set. 

Search strategy 
A non-systematic review of all data science, 

big data in healthcare-related English language 

literature published in the last decade (2010- 

2020) was conducted in November 2020 using 

MEDLINE, Scopus, EMBASE, and Google 

Scholar. Our search strategy involved creating a 

search string based on a combination of 

keywords. They were: “Big Data,” “Big Data 

Analytics,” “Healthcare,” “Artificial Intelligence,” 

"Al" “Machine learning) “ML,” “ANN,” 

“Convolutional Networks,” “Electronic Health 

Records,” “EHR,” “EMR,” “Bioinformatics,” and 

“Data Science.” We included original articles 

published in English. 

Inclusion criteria 
Articles on big data analytics, data science, 

and Al. 2. Full-text original articles on all aspects 

of application of data science in medical 
sciences. 

Exclusion criteria 

1. Commentaries, reviews, and articles with no 
full-text context and book chapters. 

2. Animal, laboratory, or cadaveric studies. The 
literature review was performed as per the 
abovementioned strategy. The evaluation of 
titles and abstracts, screening, and the full 
article text was conducted for the chosen 
articles that satisfied the inclusion criteria. 
Furthermore, the authors manually reviewed 
the selected article’s references list to 
screen for any additional work of interest. 
The authors resolved the disagreements 
about eligibility for a consensus decision 
after discussion. 

Medical care as a repository for big data 

Healthcare is a _ multilayered system 
developed specifically for ^ preventing, 
diagnosing, and treating diseases. The key 
elements of medical care are health 
practitioners (physicians — and nurses), 
healthcare facilities (which include clinics, drug 
delivery centers, and other testing or treatment 
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technologies), and a funding agency that funds the 
former. Health care practitioners belong to different 
fields of health such as dentistry, pharmacy, 
medicine, nursing, psychology, allied health sciences, 
and many more. 

Depending on the severity of the cases, health 
care is provided at many levels. In all these stages, 
health practitioners need different forms of 
information such as the medical history of the patient 
(data related to medication and prescriptions), clinical 
data (such as data from laboratory assessments), 
and other personal or private medical data. The usual 
practice for a clinic, hospital, or patient to retain these 
medical documents would be maintaining either 
written notes or in the form of printed reports [11]. 

The clinical case records preserve the incidence 
and outcome of disease in a person's body as a tale 
in the family, and the doctor plays an integral role in 
this tale [12]. With the emergence of electronic 
systems and their capacity, digitizing medical exams, 
health records, and investigations is a common 
procedure today. In 2003, the Institute of Medicine, a 
division in the National Academies of Sciences and 
Engineering coined the term "Electronic Health 
Records" for representing an electronic portal that 
saves the records of the patients. Electronic health 
records (EHRs) are automated medical records of 
patients related to an individual's physical/mental 
health or significant reports that are saved in an 
electronic system and used to record, send, receive, 
store, retrieve, and connect the medical personnel 
and patient with medical services [13]. 
Open-source big data platforms 

It is an inefficient idea to work with big data or vast 
volumes of data into storage considering even the 
most powerful computers. Hence, the only logical 
approach to process large quantities of big data 
available in a complex form is by spreading and 
processing it on several parallel connected nodes. 
Nevertheless, the volume of the data is typically so 
high that a large number of computing machines are 
needed in a reasonable period to distribute and finish 
processing. Working with thousands of nodes 
involves coping with issues related to paralleling the 
computation, spreading of data, and manage failures. 
Table 1 shows the few open sources of big data 
platforms and their utilities for data scientists. 
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Table Open source big dala platforms and their utilities 


Bip data tools Utilities 

Apache Hadoop  Itis designed to scale up to thousands of machines from single servers, each of which offers local storage 

The framework enables users to casily build and validate distributed structures, distributes data, and operates across 
machines aulomalically 

‘The Hadoop Distributed File system (HDFS) and other data stores are exible to work with 

Spark offers integrated Application Program Interfaces (APIs) which enable users to write apps in different languages 

Apache Cassandra. Cassandra is highly flexible and can add additional hardware that can handle more data and users on demand 

Cassandra adapts to all possible ata types such as unstructured, structured, and semi-structured supporting features such 
Atomicity, Consistency, Isolation, and Durability (ACID) 

In several cases, Apache Storm is casy to integrate with any programming language, with real-time analytics, online 
machine learning, and computation 

Apache Storm uses parallel calculations which run across a machine cluster 


Apache Spark 


Apache Storm 


RapidMiner provides a varity of products for a new process of data mining 

It provides an integrated dala preparation environment, machine learning, ext mining, visualization, predictive analysis, 
application development, prototype validation, and implementation, statistic modeling, deployment 

Users can spin clusters, terminate them, and only pay for what they neod 

Cloudera Enterprise can be deployed and run on AWS and Google Cloud Platforms by users 


RapidMiner 


Cloudera 


DATA MINING 

Data types can be classified based on their 
nature, source, and data collection methods 
[14]. Data mining techniques include data 
grouping, data clustering, data correlation, and 
mining of sequential patterns, regression, and 
data storage. There are several sources to 
obtain healthcare-related data (Fig. 1). The most 
commonly used type (77%) is the data 


Potient 
agencies 
Publi 
s 


Electronic Heolth 
Records (EHR) 


Smort 
phones 


1 Sources ot big data in hcalthcare 
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generated by humans (HG data) which includes 
Electronic Medical Records (EMR), Electronic Health 
Records (EHR), and Electronic Patient Records 
(EPR). Online data through Web Service (WS) is 
considered as the second largest form of data (11%) 
due to the increase in the number of people using 
social media day by day and current digital 
development in the medical sector [15]. Recent 
advances in the Natural Language Processing (NLP)- 
based methodologies are also making WS simpler to 
use [16]. The other data forms such as Sensor Data 
(SD), Big Transactional Data (BTD), and Biometric 
Data (BM) make around 12% of overall data use, but 
wearable personal health monitoring devices' 
prominence and market growth [17] may need SD 
and BM data. 


DISEASE SURVEILLANCE 

It involves the perception of the disease, 
understanding its condition, etiology (the manner of 
causation of a disease), and prevention (Fig. 3). 
Information obtained with the help of EHRs, and the 
Internet has a huge prospect for disease analysis. 
The various surveillance methods would aid the 
planning of services, evaluation of treatments, priority 
setting, and the development of health policy and 


practice. 
Apprehending 
e Health 3 


eption 


Medical signal analytics Telemetry and the devices 
for the monitoring of physiological parameters 
generate large amounts of data. The data generated 
generally are retained for a shorter duration, and 
thus, extensive research into produced data is 
neglected. However, advancements in data science 
in the feld of healthcare attempt to ensure better 
management of data and provide enhanced patient 
care [20-23]. The use of continuous waveform in 
health records containing information generated 
through the application of statistical disciplines (e.g., 
statistical, ^ quantitative, contextual, ^ cognitive, 
predictive, etc.) can drive comprehensive care 
decision-making. Data acquisition apart from an 
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ingestion-streaming platform is needed that can 
control a set of waveforms at various fidelity 
rates. The integration of this waveform data 
with the EHR's static data results in an 
important component for giving analytics engine 
situational as well as contextual awareness. 
Enhancing the data collected by analytics will 
not just make the method more reliable, but will 
also help in balancing predictive analytics' 
sensitivity and specificity. The signal processing 
species must mainly rely on the kind of disease 
population under observation. Various signal- 
processing techniques can be used to derive a 
large number of target properties that are later 
consumed to provide actionable insight by a 
pre-trained machine-learning model. Such 
observations may be analytical, prescriptive, or 
predictive. Such insights can be furthermore 
built to activate other techniques such as 
alarms and physician notifications. Maintaining 
these continuous waveforms-based data along 
with specific data obtained from the remaining 
sources in perfect harmony to find the 
appropriate patient information to improve 
diagnosis and treatments of the next generation 
can be a daunting task [24]. Several 
technological criteria and specifications at the 
framework, analytical, and clinical levels need 
to be planned and implemented for the bedside 
implementation of these systems into medical 
setups. 


DATA STORAGE 
COMPUTING 

Data warehousing and cloud storage are 
primarily used for storing the increasing amount 
of electronic patient-centric data [25, 26] safely 
and cost-effectively to enhance medical 
outcomes. Besides medical purposes, data 
storage is utilized for purposes of research, 
training, education, and quality control. Users 
can also extract files from a repository 
containing the radiology results by using 
keywords following the predefined patient 
privacy policy. 


AND CLOUD 


COST AND QUALITY OF HEALTHCARE 
AND UTILIZATION OF RESOURCES 

The migration of imaging reports to 
electronic medical recording systems offers 
tremendous potential for advancing research 
and practice on radiology through the 
continuous updating, incorporation, and 
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exchange of a large volume of data. However, the 
heterogeneity in how these data can be formatted still 
poses major challenges. The overall objective of NLP 
is that the natural human language is translated into 
structured with a standardized set of value choices that 
are easily manipulated into subsections or searches for 
the presence or absence of a finding through software, 
among other things [27]. Greaves et al. [28] analyzed 
sentiment (computationally dividing them into 
categories such as optimistic, pessimistic, and neutral) 
based on the online response of patients stating their 
overall experience to predict healthcare quality. They 
found an agreement above 8096 between online 
platform sentiment analysis and conventional paper- 
based quality prediction surveys (e.g., cleanliness, 
positive conduct, recommendation). The newer 
solution can be a cost-effective alternative to 
conventional healthcare surveys and studies. The 
physician’s overuse of screening and testing often 
leads to surplus data and excess costs [29]. The 
present practice in pathology is restricted by the 
emphasis on illness. Zhuang et al. [29] compared the 
disease-based approach in conjunction with database 
reasoning and used the data mining technique to build 
a decision support system based on evidence to 
minimize the unnecessary testing to reduce the total 
expense of patient care. 


Refine = ^" Diagnose and Collect 


( x Analyze and Store 
Treatment of the & | ny 4j 
patient QY , 


Big Data in 
& Heallhcare d 


Access ond Compute OR » Map & Match 
> aa pe 
Læ 6, 


Fig. 4 Role of big data in accelerating the treatment process 


PATIENT DATA MANAGEMENT 

Patient data management involves effective 
scheduling and the delivery of patient care during 
the period of a patient’s stay in a hospital. The 
framework of patient-centric healthcare is shown in 
Fig. 5. 
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Daggy etal. [30] conducted a study on “no 
shows" or missing appointments that lead to the 
clinical capability that has been underused. A 
logistical regression model is developed using 
electronic medical records to estimate the 
probabilities of patients to no-show and show 
the use of estimates for creating clinical 
schedules that optimize clinical capacity use 
while retaining limited waiting times and clinical 
extra-time. The 400-day clinical call-in process 
was simulated, and two timetables were 
developed per day: the conventional method, 
which assigns one patient per appointment slot, 
and the proposed method, which schedules 
patients to balance patient waiting time, 
additional time, and income according to no- 
show likelihood. If patient no-show models are 
mixed with advanced programming 
approaches, more patients can be seen a day 
thus enhancing clinical performance. The 
advantages of implementation of planning 
software, including certain methodologies, 
should be considered by clinics as regards no- 
show costs [30]. A study conducted by Cubillas 
et al. [31] pointed out that it takes less time for 
patients who came for administrative purposes 
than for patients for health reasons. They also 
developed a statistical design for estimating the 
number of administrative visits. With a time 
saving of 21.73% (660,538 min), their model 
enhanced the scheduling system. Unlike 
administrative data/target finding patients, a few 
come very regularly for their medical treatment 
and cover a significant amount of medical 
workload. Koskela etal. [32] used both 
supervised and unsupervised learning 
strategies to identify and cluster records; the 
supervised strategy performed well in one 
cluster with 8696 accuracy in distinguishing fare 
documents from the incorrect ones, whereas 
the unsupervised technique failed. This 
approach can be applied to the semi-automate 
EMR entry system [32]. 
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PRIVACY OF MEDICAL DATA 

AND FRAUDULENCY DETECTION 

The anonymization of patient data, maintaining the 
privacy of the medical data and fraudulency 
detection in healthcare, is crucial. This demands 
efforts from data scientists to protect the big data 
from hackers. Mohammed et al. [33] introduced a 
unique anonymisation algorithm that works for both 
distributed and centralized anonymization and 
discussed the problems of privacy security. For 
maintaining data usefulness without the loss of any 
data privacy, the researchers further proposed a 
model that performed far better than the traditional 
K-anonymization model. In addition to this, their 
algorithm could also deal with voluminous, multi- 
dimensional datasets. 


MENTAL HEALTH 

According to National Survey conducted on 
Drug Use and Health (NSDUH), 52.2% of the total 
population in the United States (U.S.) was affected 
by either mental problems or drug addiction/abuse 
[38]. In addition, approximately 30 million suffer from 
panic attacks and anxiety disorders [39]. 
Panagiotakopoulos etal. [40] developed a data 
analysis-focused treatment technique to help 
doctors in managing patients with anxiety disorders. 
The authors used static information that includes 
personal information such as the age of the 
individual, sex, body and skin types, and family 
details and dynamic information like the context of 
stress, climate, and symptoms to construct static 
and dynamic information based on user models. For 
the first three services, relationships between 
different complex parameters were established, and 
the remaining one was mainly used to predict stress 
rates under various scenarios. This model was 
verified with the help of data collected from twenty- 
seven volunteers who are selected via the anxiety 
assessment survey. The applications of data 
analytics in the disease diagnosis, examination, or 
treatment of patients with mental wellbeing are very 
different from using analytics to anticipate cancer or 
diabetes. In this case, the data context (static, 
dynamic, or non-observable environment) seems to 
be more important compared to data volume [39]. 


PUBLIC HEALTH 

Data analytics have also been applied to the 
detection of disease during outbreaks. Kostkova 
etal. [43] analyzed online records based on 
behaviour patterns and media reporting the factors 
that affect the public as well as professional patterns 
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of search-related disease outbreaks. They 
found distinct factors affecting the public health 
agencies’ skilled and layperson search patterns 
with indications for targeted communications 
during emergencies and outbreaks. Rathore 
etal. [44] have suggested an emergency 
tackling response unit using loT-based wireless 
network of wearable devices called body area 
networks (BANs). The device consists of 
“intelligent construction,” a model that helps in 
processing and decision making from the data 
obtained from the sensors. The system was 
able to process millions of users’ wireless BAN 
data to provide an emergency response in real- 
time. 


PHARMACOVIGILANCE 

Pharmacovigilance requires tracking and 
identification of adverse drug reactions (ADRs) 
after launch, to guarantee patient safety. ADR 
events’ approximate social cost per year 
reaches a billion dollars, showing it as a 
significant aspect of the medical care system 
[46]. Data mining findings from adverse event 
reports (AERs) revealed that mild to lethal 
reactions might be caused in paclitaxel among 
which docetaxel is linked with the lethal reaction 
while the remaining 4 drugs were not associated 
with hypersensitivity [47] while testing ADR’s 
“hypersensitivity” to six anticancer agents [47]. 
Harpaz et al. [46] disagreed with the theory that 
adverse events might be caused not just due to 
a single medication but also due to a mixture of 
synthetic drugs. It is found that there is a 
correlation between a minimum of one drug and 
two AEs or two drugs and one AE in 84% of 
AERs studies. Harpaz R etal. [47] improved 
precision in the identifcation of ADRs by jointly 
considering several data sources. When using 
EHRs that are available publicly in conjunction 
with the AER studies of the FDA, they achieved 
a 31% (on average) increase in detection [45]. 
The authors identifed dose-dependent ADRs 
with the help of models built from structured as 
well as unstructured EHR data [48]. Of the top 5 
ADR-related drugs, 4 were observed to be 
dose-related [49]. The use of text data that is 
unstructured in EHRs [50]; pharmacovigilance 
operation was also given priority. ADRs are 
uncommon in conventional pharmacovigilance, 
though it is possible to get false signals while 
finding a connection between a drug and any 
potential ADRs. These false alarms can be 
avoided because there is already a list of 
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potential ADRs that can be of great help in 
potential pharmacovigilance activities [18]. 


OVERCOMING 
BARRIER 

Having electronic health records shared 
worldwide can be beneficial in analyzing and 
comparing disease incidence and treatments in 
different countries. However, every country would 
use their language for data recording. This language 
barrier can be dealt with the help of multilingual 
language models, which would allow diversified 
opportunities for Data Science proliferation and to 
develop a model for personalization of services. 
These models will be able to understand the 
semantics — the grammatical structure and rules of 
the language along with the context — the general 
understanding of words in different contexts. For 
example: "I'll meet you at the river bank." “I have to 
deposit some money in my bank account." The word 
bank means different things in the two contexts, and 
a well-trained language model should be able to 
differentiate between these two. Cross-lingual 
language model trains on multiple languages 
simultaneously. Some of the cross lingual language 
models include: mBERT — the multilingual BERT 
which was developed by Google Research team. 
XLM — cross lingual model developed by Facebook 
Al, which is an improvisation over mBERT. Multift — 
a QRNN-based model developed by Fast.Ai that 
addresses challenges faced by low resource 
language models 


THE LANGUAGE 


CHALLENGES 


Millions of data points are accessible for 
EHR-based phenotyping involving a large number of 
clinical elements inside the EHRs. Like sequence 
data, handling and controlling the complete data of 
millions of individuals would also become a major 
challenge [51]. The key challenges faced include: 

* The data collected was mostly either unorganized 
or inaccurate, thus posing a problem to gain insights 
into it. 
* The correct balance between preserving patient- 
centric information and ensuring the quality and 
accessibility of this data is difficult to decide. 
* Data standardization, maintaining privacy, efficient 
storage and transfers require a lot of manpower to 
constantly monitor and make sure that the needs 
are met. 
* |ntegrating genomic data into medical studies is 
critical due to the absence of standards for producing 
next-generation sequencing (NGS) data, handling bio 
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informatics, data deposition, and supporting 
medical decision-making [52]. 
* Language barrier when dealing data 


FUTURE DIRECTIONS 

Healthcare services are constantly on the 
lookout for better options for improving the 
quality of treatment. It has embraced 
technological innovations intending to develop 
for a better future. Big data is a revolution in the 
world of health care. The attitude of patients, 
doctors, and healthcare providers to care 
delivery has only just begun to transform. The 
discussed use of big data is just the iceberg 
edge. With the proliferation of data science and 
the advent of various data-driven applications, 
the health sector remains a leading provider of 
data-driven solutions to a better life and tailored 
services to its customers. Data scientists can 
gain meaningful insights into improving the 
productivity of pharmaceutical and medical 
services through their broad range of data on 
the healthcare sector including financial, 
clinical, R&D, administration and operational 
details. 


CONCLUSION 

Larger patient datasets can be obtained from 
medical care organizations that include data 
from surveillance, laboratory, | genomics, 
imaging, and electronic healthcare records. This 
data requires proper management and analysis 
to derive meaningful information. Long-term 
visions for self-management, improved patient 
care, and treatment can be realized by utilizing 
big data. Data Science can bring in instant 
predictive analytics that can be used to obtain 
insights into a variety of disease processes and 
deliver patient-centric treatment. It will help to 
improvise the ability of researchers in the fled of 
science, epidemiological studies, personalized 
medicine, etc. Predictive accuracy, however, is 
highly dependent on efficient data integration 
obtained from different sources to enable it to be 
generalized. Modern health organizations can 
revolutionize medical therapy and personalized 
medicine by integrating biomedical and health 
data. Data science can effectively handle, 
evaluate and interpret big data by creating new 
paths in comprehensive medical care. Funding 
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